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DIFFERENTIAL  PREDICTION  OF  FAA  ACADEMY  PERFORMANCE  ON 
THE  BASIS  OF  GENDER  AND  WRITTEN  AIRTRAFFIC  CONTROL 
SPECIALIST  APTITUDE  TEST  SCORES 


The  Federal  Aviation  Administration  (FAA),  in  its 
1993  Diversity  Plan,  made  a  commitment  to  attract, 
retain,  develop,  and  manage  a  diverse  work  force  that 
visibly  reflected  the  American  population  at  large  by 
the  year  2000.  Achieving  this  goal  will  require  sub¬ 
stantial  changes  in  the  demographic  profile  of  the  Air 
Traffic  Control  Specialist  (ATCS)  occupation,  the 
single  largest  (17,000)  and  most  publicly  visible  occu¬ 
pational  group  in  the  agency.  Air  traffic  control  is  a 
career  field  in  which  female  workers  have  been  histori¬ 
cally  under-represented  relative  to  the  American  popu¬ 
lation  at  large.  Entry  into  the  occupation  has  been 
determined  since  1 98 1  by  applicant  performance  on  a 
written  aptitude  test  battery  administered  by  the  US 
Office  of  Personnel  Management  (OPM;  Aul,  1991). 
This  test  battery  emphasized  the  organization,  defini¬ 
tion,  and  manipulation  of  the  perceptual  field  through 
verbal  and  numeric  reasoning  (Harris,  1986).  Yet,  it  is 
exactly  such  a  test  battery  of  cognitive  abilities  that 
may  have  been  an  inadvertent  device  for  the  exclusion 
of  women  from  this  traditionally  male  occupation. 

Our  purpose  in  this  paper  was  to  examine  the 
technical  fairness  of  the  written  ATCS  aptitude  test 
battery  as  the  first  step  toward  assessing  to  what 
degree,  if  any,  that  the  battery  may  have  served  as  an 
“engine  of  exclusion”  (Seymour,  1988)  of  women 
from  the  ATCS  occupation.  By  technical  fairness,  we 
are  referring  to  the  regression  model  of  test  bias  for 
which  there  is  a  reasonable  professional  consensus,  as 
embodied  in  the  1985  Standards  for  Educational  and 
Psychological  (American  Educational  Research 

Association,  American  Psychological  Association,  & 
National  Council  on  Measurement  in  Education), 
rather  than  a  socially  constructed  standard  regarding 
test  use  (Sackett  &  Wilk,  1994;  Gottfredson,  1994). 
Technical  fairness  in  this  sense,  and  under  the  Uni' 
form  Guidelines  on  Employee  Selection  Procedures  (29 


CFR  1607),  encompasses  two  issues.  First,  the  impact 
on  protected  groups  arising  from  use  of  a  particular 
cut  score  on  the  predictor  must  be  evaluated.  A 
selection  rate  for  any  protected  group  that  is  less  than 
four-fifths  (4/5  or  80%)  of  that  of  the  majority  group 
will  “...  generally  be  regarded  by  the  Federal  enforce¬ 
ment  agencies  as  evidence  of  adverse  impact”  (29  CFR 
1607.4.D).  Second,  where  use  of  a  selection  proce¬ 
dure  results  in  adverse  impact,  the  Uniform  Guidelines 
require  that  the  test  user  evaluate  the  degree  to  which 
differential  predictions  of  future  job  performance  are 
made  from  selection  test  scores  by  subgroup  (29  CFR 
1607. 14. B. (8). (b)).  This  study  investigated  the  tech¬ 
nical  fairness  of  the  ATCS  written  aptitude  test  bat¬ 
tery  toward  women  from  two  perspectives:  adverse 
impact  and  differential  prediction. 

ADVERSE  IMPACT  ANALYSIS 

Previous  research  on  written  ATCS  selection  tests 
suggested  that  mean  score  differences  by  gender  were 
insignificant  (Rock,  Dailey,  Ozur,  Boone,  &  Pickrel, 
1984a,  pp.  476)  and  that,  overall,  “the  evidence  for 
adverse  impact  against  women  based  on  this  sample 
was  marginal,  at  best”  (Rock,  Dailey,  Ozur,  Boone,  & 
Pickrel,  1984b,  pp.  507).  This  conclusion  was  based 
primarily  on  results  of  their  1 984  study  in  which  57% 
of  men  {n  -  3835)  passed  the  screen  in  comparison  to 
45%  of  women  {n-  1473).  The  adverse  impact  ratio 
in  this  case  was  0.78  rather  than  the  0.80  required 
under  the  “four-fifths  rule  of  thumb.” 

In  the  present  study,  we  hypothesized  that  the  com¬ 
posite  of  SCO  res  earned  on  the  written  ATCS  aptitude  test 
battery,  as  used  by  OPM  to  determine  eligibility  for 
employment,  had  no  adverse  impact  on  women  appli¬ 
cants.  The  composite  of  test  scores  would  be  considered 
technically  fair  if  there  was  no  adverse  impact  arising 
from  its  use  as  a  personnel  selection  device. 
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METHOD 

Sample 

The  adverse  impact  analysis  was  based  on  determi¬ 
nations  of  eligibility  for  employment  made  by  OPM 
for  job  applicants  on  the  basis  of  a  composite  of  two 
written  test  scores.  Over  200,000  job  applicants  have 
taken  the  OPM  written  ATCS  test  battery  since  1981. 
Records  for  170,578  applicants  with  complete  test 
scores  were  available  in  the  data  base.  These  records, 
as  provided  by  OPM,  included  test  raw  scores,  gender, 
education,  and  a  determination  of  eligibility  for  em¬ 
ployment  based  on  test  scores;  racial  identification 
data  were  not  available.  Demographic  characteristics 
for  this  reference  population  of  applicants  are  pre¬ 
sented  in  Table  1.  Gender  (SEX),  as  indicated  by 
OPM  values,  was  recoded  as  0  for  males,  and  1  for 
females. 

Selection  test  scores 

The  selection  test  score  used  by  OPM  to  determine 
eligibility  for  employment  was  a  composite  of  scores 
earned  on  two  written  ATCS  aptitude  tests:  the  Mul¬ 
tiplex  Controller  Aptitude  Test  (MCAT),  and  the 
Abstract  Reasoning  Test  (ABSR).  The  development, 
psychometric  characteristics,  and  validity  of  these 
written  ATCS  aptitude  tests  have  been  extensively 
described  elsewhere  (Brokaw,  1984;  Collins,  Boone, 
&  VanDeventer,  1984;  Manning,  1991;  Sells,  Dailey, 
&  Pickrel,  1984).  Scoring  of  the  tests  was  done  ini¬ 
tially  by  summing  the  MCAT  (weighted  2)  and  ABSR 
(weighted  1)  scores.  The  resulting  weighted  scores 
were  then  transformed  via  an  OPM  test  score  trans¬ 
mutation  table  into  the  Transmuted  Composite  Score 
(TMC).  About  half  of  all  applicants  were  expected  to 
score  at  or  above  the  mean  on  this  composite  (Rock, 
Dailey,  Ozur,  Boone,  &  Pickrel,  1984a).  Applicants 
with  3  years  of  general  experience,  4  years  of  college, 
or  any  combination  of  education  and  experience  equat¬ 
ing  to  3  years  of  general  experience  and  without  prior 
aviation  experience,  were  required  to  earn  a  TMC  of 
at  least  75. 1  to  be  eligible  for  employment.  Applicants 
with  specific  air  traffic  control-related  aviation  expe¬ 
rience,  or  4  years  of  college  plus  1  year  of  graduate 
study,  were  eligible  for  employment  if  they  earned  a 
TMC  of  at  least  70.  In  other  words,  a  cut  score  of  75. 1 


or  70  on  TMC,  depending  on  applicant  background, 
was  used  to  determine  eligibility  for  employment. 
Applicants  not  meeting  these  criteria  were  ineligible 
for  employment  as  controllers.  The  determination  of 
eligibility  for  employment  was  made  by  OPM.  Codes 
indicating  that  an  applicant  had  either  failed  the  test 
(‘lA’)  or  scored  too  low  for  consideration  ('IS^j  based 
on  TMC,  were  recoded  as  test  failures.  All  other 
ineligibility  codes  were  recoded  as  “other  ineligible,” 
and  codes  indicating  eligibility  were  recoded  as  “eli¬ 
gible”  for  employment.  The  adverse  impact  analysis 
was  based  on  this  eligibility  variable. 

Procedure 

The  adverse  impact  analysis  was  conducted  in  two 
steps.  First,  TMC  distributions  were  analyzed  by 
gender;  a  ^-test  was  used  to  evaluate  mean  score 
differences.  Second,  selection  rates  on  the  basis  of 
eligibility  codes  by  gender  were  evaluated.  The  pro¬ 
portion  of  applicants  determined  to  be  eligible  on  the 
basis  of  their  test  scores  was  compared  to  the  propor¬ 
tion  ruled  as  ineligible  on  the  basis  of  test  scores; 
applicants  determined  to  be  ineligible  on  any  other 
basis  (e.g.,  age,  salary  requirements,  experience,  or 
education)  were  excluded  from  the  analysis.  Fisher’sZ 
test  was  used  to  statistically  compare  selection  rates. 

RESULTS 

Group  differences 

Analysis  of  group  differences  in  predictor  scores  by 
gender  are  presented  in  Table  2.  Males  earned  signifi¬ 
cantly  higher  mean  TMC  scores  {M  -  74.44,  SD  = 
14.17)  than  females  {M  =  69.32,  SD  -  14.37; 
^(170, 576)  =  61.75,  p  <  .001).  The  distribution  of 
TMC  by  sex  in  the  research  sample  is  illustrated  in 
Figure  1.  The  standardized  effect  size  (d)  for  gender 
on  TMC  scores  is  0.35  SD,  corresponding  to  a  small 
to  medium  effect  size  (Cohen,  1988).  This  contrasts 
with  previous  research  suggesting  that  mean  differ¬ 
ences  in  TMC  were  insignificant  (Rock,  Dailey,  Ozur, 
Boone,  &  Pickrel,  1984b).  Mean  score  differences 
might  be  expected  to  translate  into  differences  in 
selection  rates  by  gender.  In  the  ATCS  selection 
process,  only  candidates  scoring  at  or  above  the  aver¬ 
age  TMC  were  eligible  for  employment. 
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Table  1 

Demographic  characteristics  for  reference  population,  all  1985-1992  FAA 
Academy  entrants,  and  research  sample 


Characteristic 

Reference 
Population 
(A  =  170,578) 

FAA  Academy  entrants 

Academy  Sample 

entrants  (A^=9,552) 

(iV=14,392) 

Sex 

Male 

132,708 

11,460 

7,935 

Female 

37,870 

2,932 

1,617 

Race 

Asian/Pac-Island 

91 

59 

American  Indian 

195 

131 

African  American 

819 

283 

Hispanic 

525 

293 

White 

12,366 

8,555 

Missing  Data 

396 

231 

Education 

LT  High  School 

High  School 

404 

28,147 

1,576 

1,046 

Some  college 

82,414 

7,750 

5,351 

Bachelor’s  degree 

54,583 

4,745 

3,033 

Advanced  degree 

3,934 

176 

116 

Missing  Data 

1,096 

145 

6 

Age 

Mean 

26.01 

25.77 

SD 

2.99 

2.85 

Notes:  Racial  identification  and  age  data  not  available  for  reference  population  of  all 

applicants. 
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Figure  1 

Predictor  (TMC)  score  distribution  by  gender  for  reference  population  of 
applicants  (A/=  170,578) 
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Selection  ratio 

The  results  of  the  adverse  impact  analysis  by  gen¬ 
der,  based  on  OPM  eligibility  codes,  are  presented  in 
Table  3.  Approximately  half  (52. 04%)  of  the  132,708 
men  were  considered  eligible  for  employment  on  the 
basis  of  their  aptitude  test  scores.  With  a  majority 
selection  rate  of  about  50%  and  mean  differences  of 
.35  SDy  we  anticipated  a  selection  rate  for  women  of 
about  35  to  38%,  based  on  Sackett  and  Wilk  (1994). 
In  fact,  38.46%  of  the  37,870  women  were  eligible  for 
employment  on  the  basis  of  the  aptitude  test  scores. 
The  proportion  of  women  determined  to  be  eligible 
for  employment  was  significantly  less  than  the  pro¬ 
portion  of  men  {Z  =  -46.63,  p  <  .001);  the  ratio  of 
female  to  male  selection  rates  was  .74.  Using  the  4/5ths 
rule  of  thumb  of  the  Uniform  Guidelines,  it  appeared 
that  use  of  scores  on  the  written  ATCS  aptitude  test 
battery  to  determine  eligibility  for  employment  re¬ 
sulted  in  statistically  significant  adverse  impact  against 
female  applicants. 

DIFFERENTIAL  PREDICTION  ANALYSIS 

Given  the  finding  that  there  appeared  to  be  adverse 
impact  against  women,  the  Uniform  Guidelines  (29 
CFR  l607.l4.B.(8).(b))  and  Standards  for  Educational 


and  Psychological  (Standard  1.20,  p.  17)  re¬ 

quired  an  investigation  of  the  relationship  between 
test  scores  and  job  performance  for  evidence  of  differ¬ 
ential  prediction  by  subgroup.  We  hypothesized  that 
there  was  no  difference  in  the  predictive  validity  of  the 
test  battery  by  gender. 

METHOD 

Sample 

The  differential  prediction  analysis  was  based  on  a 
sample  of  persons  actually  hired  by  the  FAA  on  the 
basis  of  their  aptitude  test  scores.  Between  October 
1985  and  January  1992,  a  total  of  14,392  ATCS 
candidates  entered  the  FAA  Academy.  The  majority 
(11 ,405)  had  competed  under  civil  service  regulations 
for  hire  and  were  entering  the  Academy  for  the  first 
time.  Complete  gender,  racial  identification,  predic¬ 
tor,  and  criterion  data  were  available  for  the  research 
sample  of 8,842  male  and  female  students.  There  were 
7,332  (82.9%)  men  and  1,510  (17.1%)  women  in  the 
sample.  Demographic  information  for  all  Academy 
entrants  and  the  research  sample  is  presented  in  Table 
1.  As  with  the  reference  applicant  population,  gender 
(SEX)  was  coded  as  0  for  males,  and  1  for  females. 
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Table  2 


Mean  predictor  and  criterion  score  differences  by  gender 


Variable 

Group 

N 

M 

SD 

SE 

t 

df 

TMC 

Males 

132,708 

74.44 

14.17 

0.039 

61.75*** 

170,576 

Females 

37,870 

69.32 

14.37 

0.074 

SCREEN 

Males 

10,252 

72.75 

11.70 

0.116 

9.51*** 

12,754 

Females 

2,504 

10.26 

12.00 

0.240 

**><.001 


Table  3 


Adverse  impact  analysis  for  reference  pop.  by  gender  for  based 
on  OPM  eligibility  codes 


OPM  Eligibility 

Sex 

Males 

Females 

Row  totals 

Eligible 

69,056 

14,564 

84,070 

(52.04%) 

(38.46%) 

Failed  test 

49,902 

20,077 

69,979 

(37.60%) 

(53.02%) 

Other  ineligible 

13,750 

3,229 

16,979 

(10.36%) 

(8.53%) 

Column  totals 

132,708 

37,870 

170,578 
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Measures 

Predictors.  TMC  was  used  in  our  differential  pre¬ 
diction  analyses  as  the  measure  of  candidate  aptitude 
as  it  provided  a  measure  of  ability  unadjusted  for 
previous  experience  and/or  military  service.  Descrip¬ 
tive  statistics  for  the  predictor  scores  are  presented  in 
Table  4  for  the  reference  population  of  applicants,  all 
Academy  entrants,  and  the  research  sample. 

Criterion.  The  criterion  in  the  differential  predic¬ 
tion  analysis  was  performance  in  the  FAA  Academy 
initial  ATCS  training  program,  known  as  the  ATCS 
Nonradar  Screen  (“the  Screen”).  Training  may  be 
used  as  a  criterion  measure  where  success  in  training 
is  “properly  measured,”  and  the  relevance  of  the 
training  can  be  demonstrated  through  comparison  of 
training  content  to  critical  or  important  job  behaviors 
or  by  showing  that  training  measures  are  related  to 
subsequent  measures  of  job  performance  (29  CFR 
1607. 14.B.(3)).  The  Screen  was  originally  established 
in  response  to  recommendations  made  by  the  US 
Congressional  House  Committee  on  Government 
Operations  (US  Congress,  1976)  to  “...  provide  early 
and  continued  screening  to  insure  the  prompt  elimi¬ 
nation  of  unsuccessful  trainees  and  relieve  the  re¬ 
gional  facilities  of  much  of  this  burden”  (p.  13).  The 
Screen  was  based  upon  a  miniaturized  training-test¬ 
ing-evaluation  personnel  selection  model  (Siegel,  1978, 
1983;  Siegel  &  Bergman,  1975)  in  which  individuals 
with  no  prior  knowledge  of  an  occupation  are  trained 
and  then  assessed  for  their  potential  to  succeed  in  the 
job.  Performance  in  the  Screen  has  been  shown  to 
predict  subsequent  performance  in  radar-based  train¬ 
ing  1  to  2  years  after  entry  into  the  occupation  (Broach 
&  Manning,  1994)  as  well  as  completion  of  the 
rigorous  on-the-job  training  sequence  and  certifica¬ 
tion  as  a  qualified  “full  performance  level”  controller 
(Della  Rocco,  Manning,  &  Wing,  1990;  Manning, 
Della  Rocco,  &  Bryant,  1989). 

Thirteen  assessments  of  performance,  including  six 
classroom  tests,  observations  of  performance  in  six 
laboratory  simulations  of  non-radar  air  traffic  con¬ 
trol,  and  a  final  written  examination,  were  made 
during  the  Screen  (Della  Rocco,  Manning,  &C  Wing, 
1990).  The  final  summed  composite  score  (SCREEN) 
was  weighted  20%  for  academics,  60%  for  laboratory 
simulations,  and  20%  for  the  final  examination.  A 


minimum  SCREEN  score  of  70  was  required  to  pass 
the  Academy  program.  This  final  composite  score  was 
the  criterion  measure  in  this  study.  Descriptive  statis¬ 
tics  for  SCREEN  scores  are  also  presented  in  Table  4 
for  all  Academy  entrants  and  for  the  research  sample. 

Procedure 

The  classical,  regression-based  model  of  test  bias 
was  used  as  our  analytic  framework  to  evaluate  the 
degree  to  which  the  written  ATCS  test  battery  differ¬ 
entially  predicted  performance  in  the  Screen.  A  step- 
down  hierarchical  multiple  regression  analysis 
(Lautenschlager  &  Mendoza,  1986)  was  used  to  evalu¬ 
ate  test  bias.  The  step-down  approach  overcomes  the 
shortcomings  of  the  various  step-up  procedures 
(Bartlett,  Bobko,  Mosier,  &  Hannan,  1978;  Cohen  & 
Cohen,  1975)  by  accounting  for  the  various  changes 
in  the  sum  of  squared  error  term  incrementally,  while 
at  the  same  time  ensuring  more  statistical  power  than 
the  other  methods  (Lautenschlager  &  Mendoza).  Step- 
down  analysis  assumes  the  null  hypothesis  that  a 
common  regression  line  provides  the  best  least-squares 
fit  to  the  data.  The  alternative  is  that  a  full  model, 
including  slope  and  intercept  differences  between 
groups,  is  required  to  provide  a  significantly  better  fit 
to  the  data. 

Our  step-down  analysis  was  conducted  as  follows, 
using  the  SPSS  (SPSS,  Inc.,  1989)  regression  proce¬ 
dure.  First,  SCREEN  was  regressed  on  TMC  only 
(basic  model).  Second,  the  criterion  was  regressed  on 
TMC,  the  dummy  coded  group  membership  variable, 
and  the  cross-product  of  TMC  and  that  dummy- 
coded  variable  (full  model).  This  full  model  was  tested 
against  the  simple  model  of  criterion  and  predictor 
test  only  for  an  incremental  change  in  the  (good- 
ness-of-fit  index).  A  significant  change  in  sug¬ 
gested  potential  bias  and  dictated  that  further  testing 
for  slope  and/ or  intercept  differences  for  the  groups  be 
done.  Third,  to  test  for  slope  differences  between 
groups,  SCREEN  was  regressed  on  TMC  and  the 
dummy-coded  variable  indicating  group  membership 
(group  model),  and  compared  to  the  full  model.  A 
significant  increment  in  the  7?^,  based  on  a  compari¬ 
son  of  the  group  to  full  model,  implied  different 
slopes.  Fourth,  if  slope  differences  were  found,  then 
SCREEN  was  regressed  on  TMC  and  the  cross-product 
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Table  4 

Descriptive  statistics  for  reference  popuiation  of  job  applicants,  all  1985-1992  FAA  Academy 
entrants,  and  research  sample 


Variable 

Reference  population  (N=l  70,578) 

All  Academy  entrants  (N=  14,392) 

Research  sample  (N=9,552) 

M 

SD 

Min 

Max 

M 

SD 

Min  Max 

M 

SD 

Min 

Max 

TMC 

73.30 

14.37 

19.53 

100.00 

91.08 

5.43 

70.00  100.00 

91.55 

5.03 

70.00 

100.00 

SEX 

0.22 

0.42 

0.20 

0.40 

0.17 

0.38 

TMC^SEX 

15.39 

29.60 

0.00 

100.00 

16.18 

34.80 

0.00  100.00 

15.52 

34.43 

0.00 

100.00 

SCREEN 

72.26 

11.80 

27.16  99.47 

71.68 

11.36 

27.16 

97.59 

Notes:  Screen  score  not  applicable  for  reference  population  of  job  applicants. 
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of  aptitude  and  group  membership  (cross-product 
model).  The  cross-product  model  was  then  compared 
to  the  group  model;  a  significant  change  in  indi¬ 
cated  intercept,  as  well  as  slope  differences  between 
groups.  If  no  slope  differences  were  found,  then  the 
cross-product  model  was  compared  to  the  basic  model; 
a  significant  change  in  indicated  only  intercept 
differences  between  groups.  The  general  logic  and 
associated  SPSS  syntax  for  the  step-down  hierarchical 
regression  analysis  are  illustrated  in  Figure  2. 

Technical  feasibility 

Restriction  in  range,  statistical  power,  and  crite¬ 
rion  bias  are  considerations  in  evaluating  the  techni¬ 
cal  feasibility  of  a  test  fairness  investigation  that  must 
be  explicitly  considered  under  the  Uniform  Guidelines 
(29  CFR  l607.l4.B.(8).(c)  and  (e);  29  CFR 
1 607.1 6. U).  Both  explicit  and  incidental  restriction 
in  range  are  recurrent  problems  in  ATCS  selection 
research,  as  evidenced  by  the  sample  sizes  and  descrip¬ 
tive  statistics  in  Table  4.  Variance  in  TMC  for  the 
research  sample  was  explicitly  restricted  in  range  due 
to  selection.  Therefore,  correlations  between  TMC 
and  the  SCREEN  criterion  were  corrected  with  re¬ 
spect  to  the  reference  population  of  170,578  appli¬ 


cants,  using  the  formula  presented  by  Ghiselli, 
Campbell,  and  Zedeck  (1981,  p.  299).  Correlations 
between  variables  indicating  gender  and  the  criterion 
were  incidentally  restricted  in  range.  These  gender- 
criterion  correlations,  including  the  gender-by-pre- 
dictor  crossproduct  to  SCREEN  correlation,  were 
corrected  with  respect  to  the  reference  population  of 
170,578  applicants  using  the  Ghiselli,  Campbell,  and 
Zedeck  (1981,  p.  304)  formula  for  incidental  range 
restriction.  Finally,  values  for  the  population  correla¬ 
tions  between  gender  and  the  gender-predictor 
crossproduct  were  computed.  The  overall  structure  of 
the  correlation  matrix  is  described  in  Table  5;  sample 
and  corrected  correlations  are  presented  in  Table  6. 
Sample  correlations,  without  corrections  for  restric¬ 
tion  in  range,  are  presented  in  the  lower  left-hand 
corner,  while  corrected  and  population  correlations 
are  presented  in  the  upper  right-hand  corner  of  the 
overall  matrix.  Separate  differential  prediction  analy¬ 
ses  were  conducted  on  the  basis  of  sample  and  cor¬ 
rected  correlations,  as  required  by  the  Uniform  Guidelines 
(29  CFR  1607.15.B.(8)). 

Sample  sizes  in  these  analyses  were  clearly  of  suffi¬ 
cient  size  to  provide  more  than  enough  power  to 
detect  even  small  statistical  effects.  We  estimated  the 


Table  5 

Correlation  matrix  structure  for  differential  prediction  analysis 


TMC 

SEX 

TMC_SEX 

SCREEN 

TMC 

SEX 

rs 

’’p 

ri 

TMC_SEX 

n 

SCREEN 

>'s 

rs 

>'s 

Rote:  Sample  correlation  matrix  structure  shown  below  the  diagonal,  corrected 

matrix  structure  above  the  diagonal,  =  sample  correlation;  =  population 

correlation,  where  population  is  reference  population  of  all  applicants;  -  sample 

correlation  corrected  for  explicit  restriction  in  range,  based  on  reference 
population  of  all  applicants;  -  sample  correlation  corrected  for  incidental 

restriction  in  range,  based  on  reference  population  of  all  applicants 
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Figure  2 

Step-down  hierarchical  regression  analysis  logic  and  SPSS  syntax 


Change  in  Ff  Nonsignificant 
significant  change  in  FF 


Change  in  FF  Nonsignificant 


significant  change  in  PF 


Change  in  PF  Nonsignficant 


Change  in  PF  Nonsigificant 


significant  change  In  PF 


significant  change  in  PF 
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available  statistical  power  using  Cohen's  (1988)  re¬ 
gression  power  tables  (Table  9.3.1)  for  as  many  as  3 
independent  variables  at  an  alpha  of  .01.  The  risk  of 
a  type  II  error  (failing  to  find  an  effect  that  in  fact  was 
present)  was  very  low,  with  a  .98  probability  of  detect¬ 
ing  even  very  small  effect  sizes  (P  <  .0 1)  with  a  sample 
of  more  than  8,000  cases. 

Finally,  as  noted  by  Sackett  and  Wilk  (1994), 
Lautenschlager  and  Mendoza  (1986),  as  well  as  by  the 
Uniform  Guidelines  (29  CFR  1 607.1 6. U),  the  feasi¬ 
bility  of  an  assessment  of  technical  fairness  depends 
upon  the  quality  of  the  job-relevant  criterion:  If  the 
criterion  was  systematically  biased  against  women,  for 
example,  then  the  regression-based  method  could  not 
be  used  to  determine  the  presence  or  absence  of 
differential  prediction  by  subgroup.  The  distribution 
of  criterion  scores  by  gender  is  shown  in  Figure  3. 
Observed  mean  score  differences  in  SCREEN  by 
gender  were  about  0.23  SD  for  the  research  sample. 
These  results  are  somewhat  less  than  the  estimated 
differences  of  .3  to  .4  SD  by  race  reported  by  Ford, 
Kraiger,  and  Schectman  (1986).  There  are  several 
possibilities  for  these  observed  differences  by  gender: 
these  mean  criterion  differences  may  represent  some 
degree  of  “systematic  bias”  against  women;  the  seem¬ 
ing  bias  may  have  been  confounded,  at  least  in  part, 
with  differences  attributable  to  selection;  these  differ¬ 
ences  may  reflect  true  distinctions  in  performance 
that  were  incidental  to  sex;  and  the  apparent  bias  may 
be  due  to  differences  in  information  processing  strat¬ 
egies  used  by  the  sexes.  However,  the  limits  of  the  data 
available  for  this  study  did  not  permit  a  definitive 
evaluation  of  these  alternatives.  Therefore,  we  cannot, 
with  certainty,  claim  an  unbiased  criterion.  Yet,  in 
accordance  with  the  Uniform  GuidelineSy  Screen  score 
was  properly  measured,  and  related  to  subsequent 
organizationally  valued  outcomes.  It  was  also  no  more 
biased  than  measures  used  in  previous  published  se¬ 
lection  test  fairness  studies.  Therefore,  it  was  an  ap¬ 
propriate  criterion  in  this  assessment  of  technical 
fairness  under  the  Uniform  Guidelines. 


RESULTS 

Without  corrections  for  restriction  in  range 

The  adverse  impact  analysis  suggested  that  use  of 
TMC  to  determine  eligibility  for  employment  as  an 
air  traffic  controller  may  have  contributed  to  a  situa¬ 
tion  of  adverse  impact  against  women.  The  focus  of  an 
evaluation  of  technical  fairness,  therefore,  shifted  to 
the  degree  to  which  the  predictor  score  differentially 
predicted  the  criterion.  Sample  correlations,  without 
corrections  for  restriction  in  range,  are  presented  in 
the  lower  left-hand  triangle  of  the  matrix  in  Table  6. 
TMC  was  significantly  correlated  with  final  score  in 
the  Academy  Screen  (r=  .1844,  p  <  .001)  and  slightly 
with  the  predictor-group  crossproduct  (r  =  .0296, < 
.01).  Gender  (SEX)  was  negatively  correlated  with  the 
criterion  SCREEN  score  (r  =  -.0847,/^  <  .001),  where 
gender  was  coded  as  1  for  females  and  0  for  males.  The 
results  of  the  differential  prediction  analysis  using  the 
step-down  hierarchical  regression  analysis  on  the  basis 
of  the  sample  correlation  matrix  without  any  correc¬ 
tions  for  restriction  in  range  are  presented  in  Table  7. 
The  null  hypothesis  that  a  common  regression  line 
provided  the  best  fit  was  rejected  in  the  first  analysis, 
suggesting  the  presence  of  some  degree  of  test  bias. 
The  increment  in  R^  gained  by  using  the  full  model 
(predictor,  group  membership,  and  crossproduct), 
rather  than  the  basic  model  (predictor  only),  was 
significant  (Ai?^=  .008,A/'=  38.60,^^ <  .001).  Next, 
the  null  hypothesis  of  same  slopes  by  gender  could  not 
be  rejected;  the  subgroup  model  (predictor  and  group 
membership)  did  not  explain  any  less  variance  than 
the  full  model  (A  =  0,  A  F=  1 .02,  ns) .  Following  the 
analytic  logic  illustrated  in  Figure  1,  the  basic  and 
subgroup  models  were  next  compared  to  determine  if 
the  intercepts  were  different  for  men  and  women.  The 
null  hypothesis  of  same  intercepts  was  rejected,  with 
removal  of  SEX  leading  to  a  significant  reduction  in 
the  amount  of  explained  variance  (A  =  -.008,  A  F= 
76. 1 8,7?  <  .001).  Overall,  the  results  obtained  with  the 
uncorrected  correlations  indicated  significant  inter¬ 
cept  differences,  but  no  differences  in  slopes  by  gender. 
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Table  6 


Sample  and  corrected  correlation  matrix  for  differential 
prediction  analysis  by  gender 


TMC 

SEX 

TMC_SEX 

SCREEN 

TMC 

-0.1479*** 

-0.0361*** 

0.4724 

SEX 

0.0106 

0.9735*** 

-0.0661 

TMC_SEX 

0.0296** 

0.9986*** 

-0.0408 

SCREEN 

0.1844*** 

-0.0877*** 

-0.0847*** 

Note:  As  described  in  Table  5,  correlations  between  **p  <  .01,  <  .001 

SCREEN  and  TMC,  SEX,  and  TMC_SEX  in  the  ^  ^  ~ 

Upper  right-hand  comer  are  corrected,  and  therefore, 
no  significance  tests  are  reported 


Figure  3 

Criterion  (SCREEN)  score  distribution  by  gender  in  research  sample  (N  = 
12,756) 


0  10  20  30  40  50  60  70  80  90  100 

SCREEN 
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Table  7 

Results  of  step-down  hierarchical  regression  analysis, for  test  bias  in  research  sample  by  gender  on 
basis  of  correlation  matrix  without  corrections  for  restriction  in  range 


Analsysis 

Model 

/?2 

A/?2 

AF 

F 

Basic  V.  Full:  Overall  bias 

TMC 

0.037 

339.94*** 

TMC  +  SEX  -1-  TMC_SEX 

0.045 

0.008 

38.60*** 

140.01*** 

Full  V.  Group:  Slopes 

TMC  +  SEX  +  TMC_SEX 

0.045 

140.01*** 

TMC  +  SEX 

0.045 

-0.000 

1.02 

209.50*** 

TMC 

Full  V.  Crossproduct^ 

TMC  +  SEX  -1-  TMC_SEX 

N/A 

TMC  +  TMC_SEX 

Basic  V.  Group:  Intercepts 

TMC  +  SEX 

0.045 

209.50*** 

TMC 

0.037 

-0.008 

76.18*** 

339.94*** 

Notes:  *Full  v.  crossproduct  model  comparison  not  conducted.  See  Figure  1  for  logic  and  flow  of  step-down  ***p  <  .001 

hierarchical  regression  analysis. 
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With  corrections  for  restriction  in  range 

However,  as  shown  in  Table  2,  the  sample  range  of 
scores  on  the  predictor  was  severely  restricted;  as  a 
consequence,  evidence  based  on  those  uncorrected 
correlations  may  be  somewhat  misleading  as  to  the 
fairness  of  the  predictor  (29  CFR  1607.14. B. 8. (c)). 
Analyses  based  on  correlations  corrected  for  explicit 
and  implicit  restriction  in  range  may  provide  a  better 
assessment  of  the  fairness  of  the  OPM  test  battery 
with  respect  to  the  applicant  population.  Population 
and  corrected  correlations  are  presented  in  the  upper 
right-hand  triangle  of  the  matrix  in  Table  6.  The 
estimated  population  correlation  between  TMC  and 
performance  in  the  Academy  Screen  increased  from 
.1844  to  .4724  with  correction  for  explicit  restriction 
in  range.  After  correcting  for  incidental  restriction  in 
range,  the  correlation  between  gender  and  SCREEN 
decreased  to  -.0661 ,  as  did  the  correlation  between  the 
crossproduct  and  SCREEN  (-.0408).  The  results  of 
the  differential  prediction  analysis,  using  the  step- 
down  hierarchical  regression  analysis  on  the  basis  of 
the  corrected  correlations,  are  presented  in  Table  8. 
The  null  hypothesis  of  a  common  regression  line  was 
rejected,  suggesting  the  presence  of  some  degree  of  test 
bias.  The  increment  in  R^  associated  with  the  full 
model  over  the  basic  model  was  significant  {R^  = 
.0 1 87,i^=  ll7J6yp<  .001).  Next,  the  null  hypothesis 
of  same  slopes  by  gender  was  rejected;  the  subgroup 
model  (predictor  and  group  membership)  explained 
less  variance  than  the  full  model  (R^  =  -.0187,  F  = 
235.34,/^  <  .001).  Following  the  analytic  logic  illus¬ 
trated  in  Figure  1,  the  full  and  crossproduct  models 
were  next  compared  to  determine  if  the  intercepts 
were  different  for  men  and  women.  The  null  hypoth¬ 
esis  of  same  intercepts  was  also  rejected,  with  removal 
of  SEX  leading  to  a  significant  reduction  in  the 
amount  of  explained  variance  (i?^  =  -.  0 1 8 1  ,F=  228.4 1 , 
p  <  .001).  Overall,  the  results  obtained  with  the 
corrected  correlations  indicated  the  need  for  separate 
regression  equations  for  men  and  women  for  predic¬ 
tions  based  on  both  raw  and  standardized  predictor 
scores.  Therefore,  correlations  between  TMC  and 
SCREEN  were  computed  for  men  and  women  sepa¬ 
rately,  corrected  for  explicit  restriction  in  range  based 


on  the  standard  deviation  of  the  aptitude  by  sex 
(TMC_SEX)  interaction  term,  and  submitted  to  re¬ 
gression  analysis.  The  equation  for  men  was: 
SCREEN’  =  -24.8680  +  (1.0597  *  TMC) 

compared  to  and  equation  for  women  of 
SCREEN’  =  -23.1254  +  (1.0102  *  TMC) 

where  SCREEN’  is  the  predicted  score  in  the  FAA 
Academy  ATCS  Nonradar  Screen.  The  regression 
equations  are  plotted  in  Figure  4. 

DISCUSSION 

Overall,  the  analyses  reported  in  this  study  indi¬ 
cated  that  the  written  ATCS  aptitude  test  battery  did 
not  fulfill  the  technical  fairness  requirements  outlined 
by  the  Uniform  Guidelines  on  Employee  Selection  Pro¬ 
cedures.  The  results  of  the  adverse  impact  analysis 
indicated  that  use  of  the  weighted  composite  of  MCAT 
and  ABSR  scores  as  a  qualification  criterion  resulted 
in  the  exclusion  of  greater  proportions  of  women  than 
men  from  further  consideration  for  employment. 
Moreover,  the  adverse  impact  could  be  attributed  to  a 
specific  practice  {Antonio  v.  Ward's  Cove  Packing  Co.y 
1989;  EEOC  v.  Greyhound  LineSy  1980;  Pouncy  v. 
Prudential  Insurance  Co.,  1982)  and  was  statistically 
significant  {Hazelwood  School  District  v.  United  StateSy 
1977).  Similarly,  there  appeared  to  be  subtly  different 
relationships  for  the  sexes  between  aptitude  score  and 
subsequent  performance  at  the  Academy,  after  cor¬ 
recting  the  sample  data  correlations  for  restriction  in 
range.  The  corrected  majority  regression  line  slightly 
overpredicted  the  performance  of  the  minority  group, 
as  shown  in  Figure  4.  Schmidt  (1988)  suggested  that 
this  is  a  common  finding  in  differential  prediction 
analyses.  For  example,  Dunbar  and  Novick  (1988) 
reported  similar  results  for  predictions  of  training 
success  from  the  Armed  Services  Vocational  Aptitude 
Battery  (ASVAB)  scores  by  gender. 

Evidence  for  differential  prediction  such  as  we 
found  in  this  study  of  the  ATCS  written  aptitude  test 
battery  has  been  discounted  on  the  basis  of  factors 
such  as  use  of  inappropriate  statistical  procedures  and 
defects  in  study  designs  (Hunter,  1973).  On  one 
hand,  the  statistical  effects  detected  in  our  differential 
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Figure  4 

Regression  functions  by  gender  in  research  sample 
SCREEN 


prediction  analyses  were  generally  small  and  detect¬ 
able  only  with  very  large  samples,  after  corrections  for 
restriction  in  range.  One  might  argue  that,  as  a  conse¬ 
quence,  the  results  were  artifacts  of  an  inappropriate 
analysis  of  corrected  correlations,  and  have  little  prac¬ 
tical  significance.  We  would  counter  by  noting  that 
corrected  correlations  may,  in  fact,  provide  more 
accurate  estimates  of  test  validity,  particularly  in  large 
samples  and  under  stringent  selection  ratios  (Bobko, 
1983;  Millsap,  1988).  Uncorrected  coefficients  ap¬ 
pear  to  be  downwardly  biased  estimates  of  the  true 
population  validity  coefficients  (Lee,  Miller,  &  Gra¬ 
ham,  1982).  Therefore,  differential  prediction  analy¬ 
ses  based  on  corrected  correlations  that  provide  less 
biased  estimates  of  true  population  values  are  likely  to 
provide  similarly  less  biased  estimates  of  population 
effects,  and  are  not  artifactual.  Moreover,  we  believe 
that  these  effects  cannot  be  lightly  dismissed  in  view 
of  the  very  real  practical  consequences  for  the  ATCS 
selection  program.  One  practical  consequence  of  a 
mean  score  difference  on  TMC  of  0.35  SD  was  ad¬ 
verse  impact  on  women,  as  defined  under  the  Uniform 
Guidelines.  Moreover,  the  practical  consequence  of 
the  apparent  differential  prediction  in  the  population 
was  that  women  may  have  effectively  needed  a  higher 
TMC  than  men  to  have  an  equal  likelihood  of  passing 


the  FAA  Academy.  The  implications  of  differential 
prediction  relative  to  the  consequences  of  over¬ 
prediction  will  be  investigated  in  greater  detail  in 
another  study. 

On  the  other  hand,  unmeasured  variables  may  have 
been  confounded  with  the  predictor,  resulting  in  a 
defective  study  design  (Anastasi,  1988).  One  might 
suspect,  for  example,  that  education  and  scores  on  the 
aptitude  test  might  be  confounded  in  view  of  the 
generally  positive  correlation  between  such  tests  and 
educational  attainment:  the  group  with  lower  scores 
on  an  aptitude  test  battery  might  have  lower  overall 
educational  levels  than  the  other  group.  However,  a 
significantly  greater  proportion  of  women  (39.3%  of 
34,479)  than  men  (35.7%  of  118,735)  had  achieved 
a  baccalaureate  degree  or  more  in  the  reference  popu¬ 
lation  of  170,578  applicants  {Z-  12.22, p  <  .001).  A 
similar  pattern  was  found  for  the  sample  of 9,552  FAA 
Academy  entrants  used  in  the  differential  prediction 
analysis.  These  data  provide  some  evidence  to  suggest, 
pending  more  detailed  analyses,  that  unmeasured 
variables  such  as  education  may  not  account  for  the 
observed  differential  prediction  in  this  study. 

There  is  an  alternative  explanation  to  conclusions 
of  test  bias  or  artifactual  results  due  to  statistical 
corrections  or  unmeasured  variables.  The  results  might 
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accurately  reflect  true  differences  in  capabilities  and 
performance  by  gender.  One  recent  analysis  of  the 
ATCS  job  found  that  perceptual  processes  such  as 
visualization  and  scanning,  are  important  worker  re¬ 
quirements  (Nickels,  Bobko,  Blair,  Sands,  &  Tartak, 
1995).  Another  analysis  suggested  perceptual  speed 
and  reasoning  with  numerical  information  were  rel¬ 
evant  cognitive  attributes  to  the  controller  job  (Broach 
&  Aul,  in  preparation).  An  analysis  of  the  abilities 
required  specifically  for  success  in  the  Screen  also 
pointed  toward  the  visual-spatial  domain  of  abilities 
(Gibb,  Smith,  Swindells,  Tyson,  Gieraltowski, 
Petschauer,  &  Haney,  1991).  These  studies  indicate 
that  there  is  at  least  some  need  to  utilize  abilities  in  the 
visual-spatial  domain  in  the  performance  of  ATCS 
tasks.  The  construct  validity  study  of  the  OPM  test 
battery  conducted  by  Harris  (1986)  provided  evi¬ 
dence  that  the  MCAT,  in  particular,  measured  some 
aspects  of  this  domain  of  job-relevant  abilities  with  its 
emphasis  on  the  definition  and  manipulation  of  the 
perceptual  field  and  reasoning  with  verbal  and  nu¬ 
meric  information.  There  appear  to  be  subtle  but 
persistent  sex  differences  in  the  visual-spatial  abilities 
domain  (Halpern,  1986;  Voyer,  Voyer,  &  Bryden, 
1995).  Sex  differences  in  the  abilities  measured  by  the 
ATCS  written  aptitude  test  battery  might  explain,  in 
part,  the  mean  score  differences  observed  on  the 
predictor  TMC.  Similarly,  sex  differences  on  visual- 
spatial  abilities  important  to  performance  in  the  Screen 
might  similarly  account  for  the  apparent  differential 
prediction  of  Screen  scores  from  aptitude  scores. 
Current  research  being  conducted  under  the  FAA 
Separation  and  Control  Hiring  Assessment  (SACHA) 
project  (Bobko,  Nickels,  Blair,  &C  Tartak,  1994;  Uni¬ 
versity  Research  Corporation,  1994)  may  provide 
further  data  elucidating  the  relationships  between 
gender,  visual-spatial  abilities  as  measured  by  apti¬ 
tude  tests,  and  ATCS  job  performance. 
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